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Abstract 

Background: The emergence of Next Generation Sequencing technologies has made it possible for individual 
investigators to generate gigabases of sequencing data per week. Effective analysis and manipulation of these data is 
limited due to large file sizes, so even simple tasks such as data filtration and quality assessment have to be performed in 
several steps. This requires (potentially problematic) interaction between the investigator and a bioinformatics/ 
computational service provider. Furthermore, such services are often performed using specialized computational facilities. 

Results: We present a Windows-based application, Slim-Filter designed to interactively examine the statistical properties of 
sequencing reads produced by Illumina Genome Analyzer and to perform a broad spectrum of data manipulation tasks 
including: filtration of low quality and low complexity reads; filtration of reads containing undesired subsequences 
(such as parts of adapters and PCR primers used during the sample and sequencing libraries preparation steps); excluding 
duplicated reads (while keeping each read's copy number information in a specialized data format); and sorting reads by 
copy numbers allowing for easy access and manual editing of the resulting files. Slim-Filter is organized as a sequence of 
windows summarizing the statistical properties of the reads. Each data manipulation step has roll-back abilities, allowing 
for return to previous steps of the data analysis process. Slim-Filter is written in C++ and is compatible with fasta, fastq, 
and specialized AS file formats presented in this manuscript. Setup files and a user's manual are available for download at 
the supplementary web site (https://www.bioinfo.uh.edu/Slim_Filter/). 

Conclusion: The presented Windows-based application has been developed with the goal of providing individual 
investigators with integrated sequencing reads analysis, curation, and manipulation capabilities. 



Background 

Next Generation Sequencing instruments such as the Illu- 
mina Genome Analyzer (Illumina Inc.), SOLID (Life Tech- 
nologies), 454 (Life Sciences), and Ion PGM™ and Ion 
Proton™ (Life Technologies) are able to produce dozens 
of gigabases of sequencing data per week. The Illumina 
Genome Analyzer and SOLID platforms produce large 
data files containing relatively short subsequences {reads) 
of equal size, usually 30-120 bases long. Each step in the 
sequencing process, such as sample preparation, library 
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generation and base calling, can introduce significant 
biases and sequencing errors. For example: scanner calibra- 
tion errors can cause one or more specific nucleotides to 
appear at deviated frequencies in the sequencing dataset 
(Figure la). A high concentration of adapters or PCR pri- 
mers presented during sample preparation can cause biases 
at the beginning of the reads (Figure lb). Additionally, 
physical disturbances (such as vibration of the instrument) 
during the sequencing process can lead to a drop in the 
quality of one or more sequencing cycles (Figure lc). 

A variety of platform-independent, as well as instrument- 
specific, applications have been developed to identify such 
biases and errors in the sequencing data [1-4]. Reads ma- 
nipulation tools have also been developed to perform tasks 
such as sorting, trimming, filtering, eliminating duplica- 
tions, and reads curation [5-9]. 
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Figure 1 Typical biases in the position-by-position proportion of nucleotides and their average quality across reads produced by an lllumina GAIIx 
instrument: a) nucleotide A present in higher proportion across all positions; b) first 10+ positions are biased to adapter sequences; 
c) the overall average quality of the nucleotide G is lower across all positions and a sharp quality drop is observed near position 13. 



A major inconvenience however, is that each phase in the 
data manipulation process and in the recalculation of the 
statistical characteristics of the datasets requires separate 
applications and must be performed as a series of discon- 
nected steps. The presented interactive Windows-based 
application (Slim-Filter) performs a variety of data manipu- 
lation tasks while simultaneously monitoring the statistical 
properties of the dataset and allowing users to save and/or 
undo (roll-back) each step of the analysis. Also available is 
a Linux version that has better performance, but is lacking 
the interactive and visual capabilities of the Windows-based 
version. Slim-Filter is compatible with standard lllumina 
Genome Analyzer file formats (such as fasta and fastq), as 
well as a presented new AS (Array Subsequences) format 
designed to store only unique reads and their copy num- 
bers in descending order (by copy number). 

Implementation 

Slim-Filter for Windows is implemented in VC++ and 
supports all Windows 64 bit operating systems. Plots 
and diagrams that represent statistical properties exploit 
the .Net library and require a Microsoft.NET Framework 
3.5 (or a more recent version). The Linux version of 
Slim-Filter provides similar functionality, but is imple- 
mented as a one-step command line application. Setup 
files, example data, file format descriptions, and the user 
manual are available on the supplementary website 
(http://www.bioinfo.uh.edu/Slim_Filter). 

Windows-based interface 

Slim-Filter for Windows is organized as a chain of data 
manipulation steps which can contain several tasks. The 
resulting new set of reads can be saved in fasta, fastq, or 
AS formats. Corresponding windows summarize statis- 
tical properties of the resulting reads (Figure 2). The se- 
quential windows containing statistical summaries for 



each data manipulation step can remain open, allowing 
for a before and after comparison of the datasets. Slim- 
Filter allows the user to return (roll-back) to any of the 
previous steps of the analysis and proceed with a diffe- 
rent set of tasks and/or task parameters. The log of the 
step-by-step statistical analysis and the history of data 
manipulation is stored in a report text file. 

Data manipulation 

Slim-Filter can perform eleven different data manipula- 
tion tasks. If the input data is available in the fastq file for- 
mat, the user can (1) exclude reads with a quality score 
below a user-defined threshold, or (2) replace each low 
quality nucleotide with the "unknown nucleotide" symbol 
N. The other nine tasks do not require nucleotide quality 
information and can be performed on data in the fastq, 
fasta and AS formats. Each read that contains a single un- 
known nucleotide can be (3) replaced by four alternative 
reads where the N symbol is replaced by each of the four 
possible nucleotides (A, C, T, G). This tactic allows for the 
"recovery" of data from damaged sequencing runs such as 
the one shown in Figure lc, where a single low quality 
cycle resulted in the presence of an unknown or low qual- 
ity nucleotide in the middle of each read. After "recovery", 
resulting reads can still be used by most de-novo assembly 
and SNP detection programs. 

A sharp decline in the quality of the last nucleotides 
or the presence of undesirable prefixes may require the 
user to (4) trim the prefixes and/or suffixes of all the 
reads in the dataset. This option is available at any step of 
the data analysis. Slim-Filter's filtration capabilities in- 
clude: (5) the elimination of all reads containing odd or 
unknown characters; (6) the removal of low complexity 
reads (such as reads where a single nucleotide exhibits a 
proportion above a user defined threshold); (7) the exclu- 
sion of reads containing template subsequences; and (8) 
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Figure 2 Statistical characteristics of the set of reads: a) GC content distribution, b) entropy of individual reads distribution, c) average 
position-by-position nucleotide quality, d) position-by-position proportion of each nucleotide, e) reads copy number distribution, 
f) list of 50 most frequent reads, g) summary of the previously applied steps, h) filter options menu. 



the interactive deletion of the most abundant reads in the 
dataset Additional data manipulation features include: (9) 
the removal of duplicated and (10) reverse complementary 
reads (while maintaining their copy numbers), and (11) 
the sorting and storing of reads by order of ascending 
copy number and in the AS data format, which allows the 
user to easily view or edit the sequencing data files. 

Statistical assessment of reads 

Following each step, Slim-Filter recalculates and visua- 
lizes the statistical characteristics of the altered dataset 
in a new window (Figure 2a-g). The graphical output 
includes: the (a) GC content and (b) entropy distributions 
for all reads in the dataset, as well as only the unique reads 
(when the copy number is not considered). It also includes 
(c) the average position-by-position quality and (d) propor- 
tion of each nucleotide, (e) the copy number distribution of 
the reads, and (f ) a list of the 50 most frequent reads. These 



graphical outputs have been chosen to visually detect the 
most frequent sequencing instrument errors such as scanner 
calibration, quality drops due to mechanical disturbances, 
and sample and library preparation errors. The entropy 
of each individual read is calculated using the formula 
E = —Yfi^Pi^ipi)' where p t is the proportion of each of 
the four considered nucleotides (A, T, C, G, and N for un- 
known characters) in each given read. A short execution 
trace (g) is available for each window and contains a sum- 
mary of the previously applied step. The pop-up filter op- 
tion (Figure 2h) opens a list of tasks that can be performed 
at each given step of the analysis. The "Exclude Reads 
Containing Subsequences" option opens a new window, 
allowing the user to define subsequences to be used in the 
filtration process. At any point during the session, the user 
can save a detailed step-by-step report containing all data, 
which can be used to re-construct the presented graphs 
using third-party software such as Excel, Matlab, R, etc. 
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Results and Discussion 

Linux and windows versions 

The core code of Slim-Filter is written in standard C++ 
and is available in both Windows and Linux versions. 
The .Net chart controls are used to display statistical prop- 
erties. In case of very large input datasets and the limited 
amount of main memory for Windows desktop compu- 
ters, performance will depend on the frequency of garbage 
collection performed by the Windows operating system. 
An estimate of the required memory for different reads' 
file sizes can be found on the supplementary website. 

The Linux version lacks a graphical user interface and 
represents a single-step execution command line appli- 
cation. The Linux version also requires a full set of para- 
meters to be present in the command line. 

Future work 

We see Slim-Filter as part of a line of Windows -based 
applications focused on moving basic (quality assess- 
ment, search, mapping, and de-novo assembly) Next 
Generation Sequencing data analysis tasks from specia- 
lized computational facilities to the investigators desk- 
top computer. Future versions of Slim-Filter will support 
paired-end reads as well as reads of flexible sizes gene- 
rated by the 454 Life Sciences GS FLX and the Ion 
PGM™ and Ion Proton™ (Life Technologies) Sequen- 
cing instruments. 

Conclusion 

Slim-Filter has been developed with the goal of provi- 
ding individual investigators with integrated sequencing 
reads analysis, curation, and manipulation capabilities. It 
allows the user to assess quality, collect statistics, and 
perform basic data manipulation tasks on a set of short 
reads of equal sizes. Multiple error types such as biases 
at the beginning of the reads, damaged sequencing 
cycles, sequencing of adapters, and a drop in quality 
scores at any position in the reads can be treated or 
trimmed under both Windows and Linux versions of the 
program. Slim-Filter supports the fasta, fastq, and AS file 
formats. Setup files for Windows 64-bit operating sys- 
tems and binary files for the Linux operating system are 
available at http://www.bioinfo.uh.edu/Slim_Filter. 

Availability and requirements 

Project name: Slim-Filter 

Project home page: http://www.bioinfo.uh.edu/Slim_ 
Filter 

Operating system(s): Windows 64 bit, Linux 

Programming language: VC++/C++ 

Other requirements: Microsoft.NET Framework 3.5 or 

higher 

License: GNU 
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